Implement multi-site local sensitivity analysis workflow#1
Implement multi-site local sensitivity analysis workflow#1divine7022 wants to merge 25 commits intoccmmf:mainfrom
Conversation
|
BTW this looks really nice, well done. Though I'd like to work through the steps as I review. Could you please add a readme with instructions on how to use the repository? |
|
thanks for reviewing, updated readme in #2 |
infotroph
left a comment
There was a problem hiding this comment.
Just realized I left these comments unposted on my partial readthrough the other day, so adding now. Would it be useful for me to finish a full review now or wait for updates first?
scripts/012_aggregate_sensitivity.R
Outdated
| #---------------------------------- | ||
| # read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly | ||
| library(XML) | ||
| xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml")) | ||
| sa_variables <- unique(XML::xpathSApply( | ||
| xml_doc, | ||
| "//sensitivity.analysis//variable", | ||
| XML::xmlValue | ||
| )) | ||
| XML::free(xml_doc) |
There was a problem hiding this comment.
| #---------------------------------- | |
| # read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly | |
| library(XML) | |
| xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml")) | |
| sa_variables <- unique(XML::xpathSApply( | |
| xml_doc, | |
| "//sensitivity.analysis//variable", | |
| XML::xmlValue | |
| )) | |
| XML::free(xml_doc) | |
| sa_variables <- settings$sensitivity.analysis |> | |
| (\(x) x[names(x) == "variable"])() |> | |
| unlist() |> | |
| unique() |
| @@ -0,0 +1,108 @@ | |||
| #!/usr/bin/env Rscript | |||
There was a problem hiding this comment.
Does this site selection process differ from what David already implemented in the downscaling repo? Seems preferable to use an existing tool rather than add that complexity to the UA process.
There was a problem hiding this comment.
I have refactored this script to completely remove the clustering logic. Instead it now acts purely as a consumer of the shared design_points.csv artifact (currently the david's manually selected 198 sites) to perform the necessary PFT mapping and sub-sampling for UA.
We will treat this as the interface for now, and can revisit or refactor the integration logic once the broader coordination strategy between the Downscaling and Uncertainty workflows is finalized.
see comment #1 (comment)
R/local_sensitivity.R
Outdated
| @@ -0,0 +1,497 @@ | |||
| #' Aggregate local sensitivity results across sites | |||
There was a problem hiding this comment.
I'm not sure what "local" means here
There was a problem hiding this comment.
good catch regarding the ambiguity
In this context, 'Local' refers to the mathematical method used by pecan's variance decomposition one-at-a-time (OAT) SA, where parameters are varied individually around their median to calculate partial derivatives (elasticities). This is distinct from the global sensitivity analysis we perform later.
I have updated the function documentation to explicitly state this aggregates one-at-a-time (OAT) parameter sensitivity results to avoid confusion with geographical locality.
R/local_sensitivity.R
Outdated
| #' Aggregate local sensitivity results across sites | ||
| #' | ||
| #' @param sensitivity_outdir Directory containing PEcAn sensitivity outputs | ||
| #' @param design_points Data frame of site metadata |
There was a problem hiding this comment.
I recommend documenting what columns are expected in this df, and whether unexpected columns are ignored or cause errors
R/local_sensitivity.R
Outdated
| ) | ||
|
|
||
| # Process all files | ||
| all_results <- purrr::map_dfr(sa_files, function(sa_file) { |
There was a problem hiding this comment.
This will be much easier to read and understand if you define it as a function with an informative name -- yes, even if it's only called once as all_results <- purrr::map_dfr(sa_files, name_of_function)
There was a problem hiding this comment.
I have extracted the file processing logic into a standalone helper function process_sa_file
Thanks for the comments, |
…rely as a consumer of the shared design_points.csv artifact
…t list extraction
| # Currently, this script consumes the 'design_points.csv' from the shared | ||
| # directory. At this stage of the project, these are MANUALLY SELECTED | ||
| # points (198 sites), not yet the output of the automated clustering | ||
| # workflow. | ||
| # | ||
| # TODO: Once the integration architecture between the Downscaling and | ||
| # Uncertainty repos is finalized, refactor this script, continue to consume | ||
| # the artifact generated by that pipeline. | ||
| # ======================================================================= |
dlebauer
left a comment
There was a problem hiding this comment.
Overall, this is very nice work. The code is clear and easy to understand.
I've read through the scripts and R functions. My next step will be to review the report. I've made a few comments. Not all need to be done now, but please write tickets or todos to capture future work.
| pecan_xml_template: "data_raw/template.xml" | ||
| sites: | ||
| design_points_file: "data_raw/sa_design_points.csv" | ||
| sensitivity: |
There was a problem hiding this comment.
The goal with this file was to handle settings that aren't in the pecan xml. Is there a reason that these are included here rather than the template.xml?
Minimizing config options here, and providing sensible defaults in the template.xml could make it more clear to end users.
| @@ -8,8 +8,19 @@ default: | |||
| raw_data_dir: "data_raw" | |||
| cache_dir: "cache" | |||
| pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | |||
There was a problem hiding this comment.
Once we have ccmmf_dir, can we change this to use that as a variable? That way it is only necessary to change the system-specific path once. (also applies to master_design points etc. )
| pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | |
| pecan_outdir: "$ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701" |
| raw_data_dir: "data_raw" | ||
| cache_dir: "cache" | ||
| pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701" | ||
| master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv" |
There was a problem hiding this comment.
nit: these are just design points ...
| master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv" | |
| design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv" |
| cfg <- config::get(file = "000-config.yml") | ||
|
|
||
| # Define paths based on config | ||
| master_file <- cfg$paths$master_design_points |
There was a problem hiding this comment.
These may seem like trivial naming requests, but clarity and specificity is important.
- why is this 'master_design_points' ... that implies that there are other sets of design points.
- I would call this object
design_points_csvinstead of master_file - instead of master_data, I would call that object
design_points.
| ) | ||
| ) |> | ||
| dplyr::filter(!is.na(pft)) |> | ||
| dplyr::slice_sample(n = n_sample) |> |
There was a problem hiding this comment.
why are we sampling here?
| # Prepares design points for Uncertainty Analysis. | ||
| # | ||
| # Currently, this script consumes the 'design_points.csv' from the shared | ||
| # directory. At this stage of the project, these are MANUALLY SELECTED |
There was a problem hiding this comment.
these were not manually selected, they were from the site selection
| #!/usr/bin/env Rscript | ||
|
|
||
| # ======================================================================= | ||
| # 001_setup_design_points.R |
There was a problem hiding this comment.
The fact that this script exists makes me wonder whether the clustering workflow that generates design_points.csv should be updated - that will help ensure consistency in workflows that consume it.
| raw_data_dir <- cfg$paths$raw_data_dir | ||
|
|
||
| # SA configuration (configurable paths) | ||
| options <- list( |
There was a problem hiding this comment.
it is unclear - why are configs hard coded here if they are already in the config file and/or template.xml?
| ) | ||
| site_info <- site_info |> dplyr::rename(id = site_id) | ||
|
|
||
| # Read template settings |
There was a problem hiding this comment.
consider - what is the minimum amount of information that we want the end users to be able to configure, and why?
In addition to file paths, I can see n_sample, start_date, end_date being useful for being able to trigger 'development' mode.
| dir.create(settings$outdir, recursive = TRUE) | ||
| } | ||
|
|
||
| # Handle workflow resumption |
There was a problem hiding this comment.
is it worth the effort and overhead to manage a STATUS file? My recollection is that the STATUS file was developed primarily for the PHP web interface to display. It seems that dropping the status file would simplify the code a lot. What benefit does it provide, if any?
This is a context where targets might be useful.
dlebauer
left a comment
There was a problem hiding this comment.
Overall the report is a nice analysis. A few changes could make the results easier to interpret.
- Read and implement "turning tables into graphs"
- Lead with a 3–5 bullet summary of the top findings (what parameters are most important to calibrate, how do these vary w/ environmental drivers).
- display all of the elasticity plots at once using faceting.
- show variance across sites in elasticity plot
- Show sites on a map similar to the one in showing the design points in the downscaling repo
- Replace site table with a rank‑distribution plot (median + IQR) and link to downloadable CSV.
Make gradient results a small multiple of partial dependence plots for top 3 parameters, rather than a long table.
|
|
||
| Parameter rankings by mean absolute elasticity identify which inputs most strongly influence model outputs, averaged across all environmental conditions. High elasticity indicates the model is highly sensitive to that parameter. | ||
|
|
||
| ::: panel-tabset |
There was a problem hiding this comment.
It would be helpful to be able to view all of these plots at the same time, either as panels or different color bars.
By default, I prefer to show medians, especially for non-normally distributed values. If they exist, consider showing the distribution of values (e.g., across sites)
|
|
||
| ------------------------------------------------------------------------ | ||
|
|
||
| # Site-by-Site Sensitivity {#sec-site-sensitivity} |
There was a problem hiding this comment.
The site-by-site table is difficult to interpret and doesn't add much once across site variance is shown in the elasticity plot. Lets remove it. Andrew Gelman’s “Let’s practice what we preach: turning tables into graphs” makes a good case for visual summaries when tables get this dense: https://sites.stat.columbia.edu/gelman/research/published/dodhia.pdf
(The reason that I put a dense table of county aggregated means in the downscaling report is that those values are the numbers that the client wants)
A map would be a nice way to show the distribution of sites. It could eventually (not required for MVP) be interesting to map elasticity of the first, say, five parameters.
A few small usability notes for the style of table:
- when rendered as HTML, the table scrolls horizontally but headers don’t stay aligned.
- Two significant figures likely be enough; additional precision is noise.
- The cell color bars are helpful for relative magnitude.
There was a problem hiding this comment.
Consider something like a heat map. it isn't necessary to show every site, but perhaps summarize by environmental cluster?
See Dietze et al 2017 https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2013JG002392
|
|
||
| # Technical Details {#sec-technical} | ||
|
|
||
| ## Sensitivity Metrics |
There was a problem hiding this comment.
link to this section above, or perhaps it could be in the beginning.
| - **Negative slope**: Sensitivity *decreases* along gradient | ||
| - **High** $R^2$: Strong environmental control on parameter importance | ||
|
|
||
| ## Variance Explained Trends |
There was a problem hiding this comment.
its not clear why there are only two unique values of r2 in the table
| ::: | ||
|
|
||
| ## Summary Statistics | ||
|
|
|
|
||
| Regression analysis tests whether parameter sensitivity systematically varies with environmental conditions, identifying context-dependent calibration needs. | ||
|
|
||
| ## Regression Results |
There was a problem hiding this comment.
one way these could be displayed is using facets
- each row a different parameter (for top params / greatest slopes)
- each column a different environmental covariate
- each plot has x = env covariate and y = elasticity. Could use color to plot multiple outputs (SOC, NPP, etc) on the same plot, or create separate plots for each output.
Another note on plots - by default, keep the axes (and other scales) the same size even across different variations of a plot (e.g. when all elasticity should have the same range. Sometimes it is necessary to change this role - e.g. to show interesting trends that would otherwise be obscured. This makes it easier to compare.
| parameter_rankings <- aggregated_results |> | ||
| dplyr::group_by(parameter, response_var) |> | ||
| dplyr::summarize( | ||
| # Mean absolute elasticity (primary metric for sensitivity) |
There was a problem hiding this comment.
taking the absolute value of elasticity makes sense when ranking parameter importance, but for plotting and regression, we should use untransformed elasticity.
There are lots of ways to plot this information but this is from LeBauer et al 2013 (shown here mostly to note that the direction of elasticity is also important information).
|
Additional items for the "TODO" list
|

Summary:
Implements a complete multi-site local sensitivity analysis (SA) workflow for SIPNET as specified in #151. This PR delivers aggregated sensitivity results across design points, environmental gradient analysis, and a Quarto-based report summarizing parameter importance patterns.
Addresses Issue #151 Requirements